feat: add HttpAgent, per-step evaluation, and lightweight trace export#118
Merged
Conversation
Three platform infrastructure features: 1. HttpAgent (agents/http_agent.py): Generic agent-as-HTTP-service that forwards observations to any remote endpoint and parses BenchmarkAction responses. Enables teams to deploy custom agent stacks (model + prompt + parsing) as black-box HTTP servers, cleanly solving GPU/CPU separation. 2. Per-step evaluation in RLEnvironment: New evaluate_every_step parameter calls the WAA evaluator after each step and populates info["evaluation_score"]. Does NOT change the reward signal — training code decides how to use it. Useful for online RL training loops. 3. LightweightTraceExporter: Plain JSON + screenshots trace export with no openadapt-ml dependency. Produces episode JSON, manifest, and JSONL training samples in a universal format. All 34 new tests pass. 984 existing tests unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
95ebea5 to
c12e097
Compare
This was referenced Mar 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three platform infrastructure features for generalizable agent integration:
HttpAgent(agents/http_agent.py): Generic agent-as-HTTP-service. Any team can deploy their agent stack as an HTTP server — the orchestrator sends{screenshot, instruction, viewport}toPOST /actand gets back aBenchmarkAction. Cleanly solves the GPU/CPU separation problem (models need GPUs, WAA VMs need nested-virt CPU instances). Includes health check, graceful error handling, and optional/resetnotification.Per-step evaluation in
RLEnvironment: Newevaluate_every_step=Trueparameter calls the WAA evaluator after each step and populatesinfo["evaluation_score"]. The reward signal is NOT changed (stays 0.0 mid-episode) — training code decides how to use the per-step evaluation data. Evaluation errors are caught gracefully.LightweightTraceExporter: Plain JSON + screenshots trace export with no openadapt-ml dependency. Produces episode JSON files, manifest, and JSONL training samples in a universal format that any training pipeline can consume.Test plan
🤖 Generated with Claude Code